Goto

Collaborating Authors

 noisy document



I2DFormer: Learning Image to Document Attention for Zero-Shot Image Classification

Neural Information Processing Systems

Despite the tremendous progress in zero-shot learning (ZSL), the majority of existing methods still rely on human-annotated attributes, which are difficult to annotate and scale. An unsupervised alternative is to represent each class using the word embedding associated with its semantic class name.


MAIN-RAG: Multi-Agent Filtering Retrieval-Augmented Generation

Chang, Chia-Yuan, Jiang, Zhimeng, Rakesh, Vineeth, Pan, Menghai, Yeh, Chin-Chia Michael, Wang, Guanchu, Hu, Mingzhi, Xu, Zhichao, Zheng, Yan, Das, Mahashweta, Zou, Na

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are becoming essential tools for various natural language processing tasks but often suffer from generating outdated or incorrect information. Retrieval-Augmented Generation (RAG) addresses this issue by incorporating external, real-time information retrieval to ground LLM responses. However, the existing RAG systems frequently struggle with the quality of retrieval documents, as irrelevant or noisy documents degrade performance, increase computational overhead, and undermine response reliability. To tackle this problem, we propose Multi-Agent Filtering Retrieval-Augmented Generation (MAIN-RAG), a training-free RAG framework that leverages multiple LLM agents to collaboratively filter and score retrieved documents. Specifically, MAIN-RAG introduces an adaptive filtering mechanism that dynamically adjusts the relevance filtering threshold based on score distributions, effectively minimizing noise while maintaining high recall of relevant documents. The proposed approach leverages inter-agent consensus to ensure robust document selection without requiring additional training data or fine-tuning. Experimental results across four QA benchmarks demonstrate that MAIN-RAG consistently outperforms traditional RAG approaches, achieving a 2-11% improvement in answer accuracy while reducing the number of irrelevant retrieved documents. Quantitative analysis further reveals that our approach achieves superior response consistency and answer accuracy over baseline methods, offering a competitive and practical alternative to training-based solutions.


Unleashing Multi-Hop Reasoning Potential in Large Language Models through Repetition of Misordered Context

Yu, Sangwon, Kim, Ik-hwan, Song, Jongyoon, Lee, Saehyung, Park, Junsung, Yoon, Sungroh

arXiv.org Artificial Intelligence

Multi-hop reasoning, which requires multi-step reasoning based on the supporting documents within a given context, remains challenging for large language models (LLMs). LLMs often struggle to filter out irrelevant documents within the context, and their performance is sensitive to the position of supporting documents within that context. In this paper, we identify an additional challenge: LLMs' performance is also sensitive to the order in which the supporting documents are presented. We refer to this as the misordered context problem. To address this issue, we propose a simple yet effective method called context repetition (CoRe), which involves prompting the model by repeatedly presenting the context to ensure the supporting documents are presented in the optimal order for the model. Using CoRe, we improve the F1 score by up to 30%p on multi-hop QA tasks and increase accuracy by up to 70%p on a synthetic task. Additionally, CoRe helps mitigate the well-known "lost-in-the-middle" problem in LLMs and can be effectively combined with retrieval-based approaches utilizing Chain-of-Thought (CoT) reasoning.


Multi-News+: Cost-efficient Dataset Cleansing via LLM-based Data Annotation

Choi, Juhwan, Yun, Jungmin, Jin, Kyohoon, Kim, YoungBin

arXiv.org Artificial Intelligence

The quality of the dataset is crucial for ensuring optimal performance and reliability of downstream task models. However, datasets often contain noisy data inadvertently included during the construction process. Numerous attempts have been made to correct this issue through human annotators. However, hiring and managing human annotators is expensive and time-consuming. As an alternative, recent studies are exploring the use of large language models (LLMs) for data annotation. In this study, we present a case study that extends the application of LLM-based data annotation to enhance the quality of existing datasets through a cleansing strategy. Specifically, we leverage approaches such as chain-of-thought (CoT) and majority voting to imitate human annotation and classify unrelated documents from the Multi-News dataset, which is widely used for the multi-document summarization task. Through our proposed cleansing method, we introduce an enhanced Multi-News+. By employing LLMs for data cleansing, we demonstrate an efficient and effective approach to improving dataset quality without relying on expensive human annotation efforts.


Benchmarking Large Language Models in Retrieval-Augmented Generation

Chen, Jiawei, Lin, Hongyu, Han, Xianpei, Sun, Le

arXiv.org Artificial Intelligence

Retrieval-Augmented Generation (RAG) is a promising approach for mitigating the hallucination of large language models (LLMs). However, existing research lacks rigorous evaluation of the impact of retrieval-augmented generation on different large language models, which make it challenging to identify the potential bottlenecks in the capabilities of RAG for different LLMs. In this paper, we systematically investigate the impact of Retrieval-Augmented Generation on large language models. We analyze the performance of different large language models in 4 fundamental abilities required for RAG, including noise robustness, negative rejection, information integration, and counterfactual robustness. To this end, we establish Retrieval-Augmented Generation Benchmark (RGB), a new corpus for RAG evaluation in both English and Chinese. RGB divides the instances within the benchmark into 4 separate testbeds based on the aforementioned fundamental abilities required to resolve the case. Then we evaluate 6 representative LLMs on RGB to diagnose the challenges of current LLMs when applying RAG. Evaluation reveals that while LLMs exhibit a certain degree of noise robustness, they still struggle significantly in terms of negative rejection, information integration, and dealing with false information. The aforementioned assessment outcomes indicate that there is still a considerable journey ahead to effectively apply RAG to LLMs.


Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models

Yu, Wenhao, Zhang, Hongming, Pan, Xiaoman, Ma, Kaixin, Wang, Hongwei, Yu, Dong

arXiv.org Artificial Intelligence

Retrieval-augmented language models (RALMs) represent a substantial advancement in the capabilities of large language models, notably in reducing factual hallucination by leveraging external knowledge sources. However, the reliability of the retrieved information is not always guaranteed. The retrieval of irrelevant data can lead to misguided responses, and potentially causing the model to overlook its inherent knowledge, even when it possesses adequate information to address the query. Moreover, standard RALMs often struggle to assess whether they possess adequate knowledge, both intrinsic and retrieved, to provide an accurate answer. In situations where knowledge is lacking, these systems should ideally respond with "unknown" when the answer is unattainable. In response to these challenges, we introduces Chain-of-Noting (CoN), a novel approach aimed at improving the robustness of RALMs in facing noisy, irrelevant documents and in handling unknown scenarios. The core idea of CoN is to generate sequential reading notes for retrieved documents, enabling a thorough evaluation of their relevance to the given question and integrating this information to formulate the final answer. We employed ChatGPT to create training data for CoN, which was subsequently trained on an LLaMa-2 7B model. Our experiments across four open-domain QA benchmarks show that RALMs equipped with CoN significantly outperform standard RALMs. Notably, CoN achieves an average improvement of +7.9 in EM score given entirely noisy retrieved documents and +10.5 in rejection rates for real-time questions that fall outside the pre-training knowledge scope.


Automatic Authorship Attribution of Noisy Documents

Sayoud, Halim (University of Sciences and Technology Houari Boumediene (USTHB)) | Khennouf, Salah (University of Sciences and Technology Houari Boumediene (USTHB)) | Benzerroug, Hocine ( Independent Researcher ) | Hamadache, Zohra (University of Sciences and Technology Houari Boumediene (USTHB)) | Hadjadj, Hassina (University of Sciences and Technology Houari Boumediene (USTHB)) | Ouamour, Siham (University of Sciences and Technology Houari Boumediene (USTHB))

AAAI Conferences

In this survey, we conduct an investigation on the robustness of several features and classifiers in automatic authorship attribution. Our corpus consists in 25 different documents written by 5 different American philosophers in English. The different documents pass throw a digital conversion into grey-scaled images and several levels of noise are added to corrupt those image documents. The noise consists in a “Salt & Pepper” type, which is randomly added on the surface of the images with the following noise levels: 0%, 1%, 2%, 3%, 4%, 5%, 6% and 7%. Thus, each image goes throw an OCR program (Optical Character Recognition) to extract the text from the image. Then, the obtained text document is kept to be used during the experiments of authorship attribution. Several features and classifiers are employed and evaluated with regards to the classification performances. Results are quite interesting and show that the most robust feature in au-thorship attribution is the character-tetragram, which provides a score of 100% even at a noise level of 7%.